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Abstract — We provide a simple physical interpretation, in the 
context of the second law of thermodynamics, to the information 
inequality (a.k.a. the Gibbs' inequality, which is also equivalent 
to the log-sum inequality), asserting that the relative entropy 
between two probability distributions cannot be negative. Since 
this inequality stands at the basis of the data processing theorem 
(DPT), and the DPT in turn is at the heart of most, if not all, 
proofs of converse theorems in Shannon theory, it is observed 
that conceptually, the roots of fundamental limits of Information 
Theory can actually be attributed to the laws of physics, in 
particular, the second law of thermodynamics, and indirectely, 
also the law of energy conservation. By the same token, in the 
other direction: one can view the second law as stemming from 
information-theoretic principles. 

Index Terms — Gibbs' inequality, data processing theorem, 
entropy, second law of thermodynamics, divergence, relative 
entropy, mutual information. 

I. Introduction 

While the laws of physics draw the boundaries between the 
possible and the impossible in Nature, the coding theorems of 
Information Theory, or more precisely, their converse parts, 
draw the boundaries between the possible and the impossible 
in the design and performance of coded communication sys- 
tems and in data processing. A natural question that may arise, 
in view of these two facts, is whether there is any relationship 
between them. It is the purpose of this work to touch upon this 
question and to make an attempt to provide at least a partial 
answer. 

Perhaps the most fundamental inequality in Information 
Theory is the so called information inequality (cf. e.g., [1, 
Theorem 2.6.3, p. 28]), which asserts that the relative entropy 
(a.k.a. the Kullback-Leibler divergence) between two proba- 
bility distributions over the same alphabet P = {P(x), x <G 
X} and Q = {Q(x), x e X}, 



D(P||Q)= ]TP(x)log 



Q(x)' 



can never be negative, and a similar fact applies to probability 
density functions with the summation across X being replaced 
by integration. 

The log-sum inequality (LSI) [1, Theorem 2.7.1, p. 31], 
which asserts that for two sets of non-negative numbers, 

(ai, a 2 , • ■ • , a n ) and (h,b 2 , b n ): 



a t log - > 

i=l ° l 



EI 



is completely equivalent to the information inequality, al- 
though proved in [1] in a rather different manner. 

Yet another name for the same inequality, which is more fre- 
quently encountered in the jargon of physicists, is the Gibbs' 
inequality: When the information inequality is applied to two 
probability distributions of the Boltzmann form (cf. Section 
IV below), it yields an interesting inequality concerning their 
corresponding free energies (cf. e.g., [2, Section 5.6, pp. 143— 
146]), which serves as a useful tool for obtaining good bounds 
on the free energy of a complex system, when its exact value 
is difficult to calculate. 

In this work, we provide a simple physical interpretation 
to this inequality of the the free energies, and thereby also 
to the information inequality, or the log-sum inequality. This 
physical interpretation is directly related to the second law 
of thermodynamics, which asserts that the entropy of an 
isolated physical system cannot decrease: According to this 
interpretation, the divergence between two probability distri- 
butions is proportional to the energy dissipated in the system 
when it undergoes an irreversible process, and hence converts 
this energy loss into entropy production, or heat. Thus, the 
non-negativity of the relative entropy is related to the non- 
negativity of this entropy change, which is, as said, the second 
law of thermodynamics. 

Since the mutual information can be thought of as an 
instance of the relative entropy, and so can the difference 
between two mutual informations defined along a Markov 
chain, then the data processing theorem (DPT) can, of course, 
also be given the very same physical interpretation. Consid- 
ering the fact that the DPT is pivotal to most, if not all, 
converse theorems in Information Theory, this means that, 
in fact, the fundamental limits of Information Theory can, 
at least conceptually, be attributed to the laws of physics, 
in particular, to the second law of thermodynamics^ The 
rate loss in any suboptimal coded communication system, is 
given the meaning of irreversibility and entropy production 
in a corresponding physical system. Optimum (or nearly opti- 
mum) communication systems are corresponding to reversible 
processes (or lack of any process at all) with no entropy 

'The information inequality is obtained from the LSI when 
(oi, da, • • • , o, n ) and (61 , 62 , ■ ■ • , b n ) both sum to unity, and conversely, 
the LSI is obtained from the information inequality, by applying the latter to 
the probability distributions Pj = aj / ■ a,- and Qi = bi/^j bj. 

"Another law of physics that plays a role here, at least indirectly, is the 
law of energy conservation, because our derivations are all based on the 
Boltzmann-Gibbs distribution of equilibrium statistical mechanics, and this 
distribution, in turn, is derived on the basis of the energy conservation law. 



production. Stated in somewhat different words, had there 
been a communication system that violated a fundamental 
limit (e.g., beating the entropy, or channel capacity), then in 
principle, one could have constructed a physical system that 
violates the second law, and vice versa. 

The outline of the remaining part of the paper is as fol- 
lows. In Section II, we give some basic back background in 
statistical physics. Section III reviews the role of the DPT 
in many of the converse theorems in the Shannon theory. In 
Section III, we offer a physical interpretation to the Gibbs' 
inequality and show how it applies to the DPT in two different 
scenarios. Finally, in Section IV, we discuss relationships 
between reversible processes in physics and error exponents 
of classical Neyman-Pearson hypothesis testing. 

II. Physics Background 

Consider a physical system with n particles, which at any 
time instant, can be found in any one out of a variety of 
microscopic states (or micorstates, for short). The microstate 
is defined by the full physical information about all n particles, 
e.g., the positions, momenta, angular momenta, spins, etc., 
depending on the type of the physical system. In particular, 
a microstate is designated by x = {x\, X2, . . . , x n ), where 
each Xi may itself be a vector, consisting of all the relevant 
physical state variables (such as the above) for particle number 
i at a given time instant. Associated with every microstate x, 
there is an energy function, a.k.a. the Hamiltonian, 8{x). For 
example, in the case of the ideal gas, Xi = (p i ,r.i), where 
p i and Ti, both three dimensional vectors, are the momentum 
and the position of particle number i, respectively, and 

2 



2m 
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(1) 



where m is the mass of each particle, g is the gravitation 
constant, and Zi is the height - one of the components of Vi. 

One of the most fundamental results in statistical physics 
(based on the law of energy conservation and the postulate 
that all microstates of the same energy are equiprobable) 
asserts that, when a system lies in thermal equilibrium with the 
environment (heat bath), the probability of finding the system 
at state x is given by the Boltzmann-Gibbs distribution 

e -pe(x) 

P(x) = " , - (2) 
v ' Z{0) 

where (3 = l/(kT), k being Boltzmann's constant and T being 
temperature, and Z(j3) is the normalization constant, called the 
partition function, which is given by 



Z(f3) 



E 

x 



e"W, arZ(fl)= / dxe~^ x \ (3) 



depending on whether x is discrete or continuous. The par- 
tition function is a key quantity from which many important 
macroscopic physical quantities can be derived. For example, 
the average internal energy w.r.t. $2^ is 

dlnZ(/3) 



the entropy (in units of k) pertaining to (O is 

E (0) *W_ = _ E[lnP{x)} = ln z(/3) + p . Ej (5) 



and the free energy is given by 



In Z{/3) 
' 



(6) 



E = E{£{X)} 



d(3 



(A) 



From eq. (0, one readily obtains the well known relationship 

F = E - ST. 

Thus, any change in the internal energy, along a fixed temper- 
ature (isothermal) process, is given by 

AE = AF + TAS, 

in other words, it consists of two components: the first is the 
change in the free energy, AF, and the second pertains to 
entropy production, TAS. By the first law of thermodynamics, 
which is actually, the law of energy conservation, 

AE = AQ + AW, 

namely, the origins of any change in the internal energy may 
be a combination of the heat AQ transferred into the system 
and the work AW applied to it. According to the thermody- 
namical definition, the entropy difference, AS, between two 
macroscopic states A and B, is defined as J A AQ/T, where 
the integration is along a quasi-static or reversible process, 
i.e., a process that is slow enough such that, along the way, 
the system is kept always very close to equilibrium. By the 
Clausius theorem (cf. e.g., [2, p. 13]), in the above described 
isothermal process, AS is never smaller than AQ/T, with 
equality when the process is reversible. Thus, by comparing 
the two expressions of AE, we immediately observe that 
AW > AF. 

The free energy is then given a meaning of crucial im- 
portance in thermodynamics and statistical physics: The dif- 
ference, AF, between the free energies associated with two 
equilibirium points pertaining to the same temperature (but 
with two different values of some other control parameter, such 
as pressure or magnetic field) has the physical meaning of the 
minimum amount of work that should be applied to the system 
in order to transfer it between these two equilibria along an 
isothermal process, and this minimum is attained when the 
process is reversible^ Equivalently, the negative free-energy 
difference, — AF, is the maximum amount of work that can be 
exploited from the system in an isothermal process, and this 
maximum is achieved, again, if the process is reversible. The 
second law of thermodynamics, as mentioned earlier, asserts 
that the entropy of an isolated system cannot decrease. 

3 This fact is also known as the minimum work principle. 



III. The Data Processing Theorem and 
Fundamental Limits 

As mentioned earlier, our observations apply to any 
fundamental limit, or converse theorem, that makes use of the 
information inequality, in one way or another. However, even 
if we confine our attention only to those that use it explicitely 
in the form of the DPT, it is not difficult to appreciate the 
fact that we already cover many of the fundamental limits, if 
not all of them. Here are just a few examples. 

Lossy /lossless source coding: Consider a source vector 
U N = (Ui,...Un) compressed into a bitstream 
X n = {X-y,. , . ,X n ) from which the decoder generates 
a reproduction V N = (Vi,...,Vn) with distortion 
E{d(Ui, Vi)} < ND. Then, by the DPT, 
I(U N ;V N ) < I{X n ;V N ) < H(X n ), where I{U N ;V N ) is 
further lower bounded by NR(D) and H(X n ) < n, which 
together lead to the converse to the lossy data compression 
theorem, asserting that the compression ratio n/N cannot be 
less than R(D). Lossless compression is obtained, of course, 
as a special case where D = 0. 

Channel coding under bit error probability: Let 
U N — (Ux,...Un) be drawn from the binary symmetric 
course (BSS), designating M = 2 N equiprobable messages 
of length N, The encoder maps U N into a channel input 
vector X n , which in turn, is sent across the channel. The 
receiver observes Y n , a noisy version of X n , and decodes 
the message as V N . Let P b = ± J2iLi ^ U, } 

designate the bit error probability. Then, by the DPT, 
I{U N ;V N ) < I(X n ;Y n ), where I(X n ;Y n ) is further 
upper bounded by nC, C being the channel capacity, 
and I{U N ;V N ) = H{U N ) - H(U N \V N ) > N - 
E£i#(W0 > N-^hiiPiiVi + Ui}) > N[l-h 2 (P b )}. 
Thus, for Pf, to vanish, the coding rate, N/n should not 
exceed C. 



Channel coding under block error probability - Fano's 
inequality: This is the same as in the previous item, except 
that the error performance is the block error probability 
P B = Pr{V N ^ U N }. This time, H{U N \V N ), which is 

identical to H(U N ,E\V N ), with E = 2{V N ^ U N } 
(X being the indicator function), is decomposed as 
H{E\V N ) + H{U N \V N ,E), where the first term is 
upper bounded by 1 and the second term is upper bounded by 
Pb log(2 JV — 1) < NPb, owing to the fact that the maximum 
of H(U N \V N \E = 1) is obtained when U N is distributed 
uniformly over all V N ^ U . Putting these facts all together, 
we obtain Fano's inequality Pb > 1 — 1/n — C/R, where 
R = N/n is the coding rate. Thus, the DPT directly supports 
Fano's inequality, which in turn is the main tool for proving 
converses to channel coding theorems in a large variety of 
communication situations, including network configurations. 

Joint source-channel coding and the separation principle: 



In a joint source-channel situation, where the source vector 
U N is mapped into a channel input vector X n and the 
channel output vector Y n is decoded into a reconsdtruction 
V , the DPT gives rise to the chain of inequalities 
NR(D) < I(U N ;V N ) < I{X n ;Y n ) < nC, which is the 
converse to the joint source-channel coding theorem, whose 
direct part can be achieved by separate source- and channel 
coding. The first two examples above are special cases of this. 

Conditioning reduces entropy: Perhaps even more often than 
the term "data processing theorem" can be found as part of a 
proof of a converse theorem, one encounters an equivalent of 
this theorem under the slogan "conditioning reduces entropy". 
This in turn is part of virtually every converse proof in the 
literature. Indeed, if (X, U, V) is a triple of RV's, then this 
statement means that H(X\V) > H(X\U, V). If, in addition, 
X -> U -> V is a Markov chain, then H(X\U,V) = 
H(X\U), and so, H(X\V) > H(X\U), which in turn is 
equivalent to the more customary form of the DPT, I(X;U) > 
I(X; V), obtained by subtracting H(X) from both sides of the 
entropy inequality. In fact, as we shall see shortly, it is this 
entropy inequality that lends itself more naturally to a physical 
interpretation. Moreover, we can think of the conditioning- 
reduces-entropy inequality as another form of the DPT even in 
the absence of the aforementioned Markov condition, because 
X — > (U, V) — > V is always a Markov chain. 

IV. Physics of the Information Inequality & DPT 

We consider two forms of the information inequality an the 
DPT, one corresponding to an isothermal process and one - 
to an adiabatic process (fixed amount of heat). 

A. Isothermal Version 

Consider a system, with a microstate x, which may have two 
possibile Hamiltonians - £o(x) and £\(x). Let Zi(f3), denote 
the partition function pertaining to £$(•), that is, Zi([3) = 



-j3£ 4 (aj) 



0,1, where (3 = l/(kT) is the inverse 



temperature. Since ft is fixed throughout this section, we will 
also use the shorthand notation Zi for the partition function. 
Let Pi(x) denote the Boltzmann-Gibbs distribution (cf. eq. 
d2J) pertaining to Zi, i = 0, 1 (both for the same given value 
of ft). Applying the information inequality to Pq and Pi, we 
get: 

[ e -P£o(x), z - 

< D(P \\P 1 ) = Y^Po(x)]n 



x 



e -0E 1 (X)/ Zl 

lnZi -InZo + ftEoiS^X) - £ Q (X)} 



(7) 



where -Eo{ } denotes the expectation operator w.r.t. P . After 
a minor algebraic rearrangement, this becomes: 

E {£i{X) - £ {X)} > fcTlnZo - fcTlnZi 

= Fx-Fo, (8) 

where Fi is the free energy pertaining to Pi, i = 0, 1 (cf. eq. 
0). 

We now offer the following physical interpretation to this 
inequality: Imagine that a system with Hamiltoinan £o(x) is 



in equilibrium for all t < 00 but then, at time t = 0, the 
Hamitonian changes abruptly from the £q{x) to £i(x) (e.g., 
by suddenly applying a force, like pressure or a magnetic 
field, to the system), which means that if the system is found 
at state x at time t = 0, additional energy of W(x) = 
£i(x) — £o(x) is suddenly 'injected' into it. This additional 
energy can be thought of as work performed on the system, 
or as supplementary potential energy. Of course, W{x) is 
a random variable due to the randomness of x. Since this 
passage between £ and £i is abrupt, and the microstate x 
does not change instantaneously, the expectation of W(X) 
should be taken w.r.t. P , and this average is exactly what 
we have at the left-hand side eq. ©. The Gibbs' inequality 
tells us then that this average work is at least as large as 
AF = Fi — Fq, the increase in free energy, in compliance to 
the explanation in Section II. The difference 

E {W(X)} - AF = kT ■ D{P Q \\Pi) > 

is due to the irreversible nature of this abrupt energy 
injection, and this irreversibility means an increase of the 
total entropy of the system and its environment]^ Thus, the 
Gibbs' inequality is, in fact, a version of the second law of 
thermodynamics, and the relative entropy is given a very 
simple physical significance. We next consider two examples. 

Example 1 - Fixed-to-variable compression and the Ising 
model. A natural information-theoretic example for this can 
be easily motivated by the interpretation of the relative en- 
tropy as the rate loss (or, the redundancy) due to mismatch 
in fixed-to-variable lossless data compression: Suppose that 
X £ {— l,+l} n emerges from a first-order Markov source 
p o( x ) = iTiLi Po{xi\xi-i), where 

Po(x\x') = eM i X ' X ' } , x,x'€{-l,+l}, 

and where J is a given constant and 

Z = 2cosh(J). 

However, the code designer designs a Shannon code according 
to Pi(x) = J]r=i Pi(x l \x l -i), where 

, exp{Jx ■ x' + Kx} , . 

Pi (40 = — Tl n x, x '€{-i,+i} 

CO') 

where K is another given constant and ((x) is the appropriate 
normalization factor given by 

2cosh(J + A") x = +1 
2cosh(J — K) x = -1 



C(x) 



4 Since the information inequality applies to any pair of distributions, it is 
conceivable that the interpretation we offer may remain relevant even beyond 
the realm of systems in equilibirium. Indeed, even if the system is away 
from equilibrium, when it is nevertheless in steady state (in the sense that 
macroscopic physical quantities are time-invariant), the negative logarithm of 
the density function can be given the meaning of an effective Hamiltonian 
[3]. This, however, is beyond the scope of this work. 

5 See also [4], [5], [6] and references therein, where the same conclusions 
are reached from a more general perspective of irrreversible processes, but 
under certain limiting assumptions on the physical system. 



Considering the fact that x E {—1,-1-1}, ((x) can also be 
written in a unified way as 

"cosh(J + /v) llV2 



C(a:) = Z x 



cosh( J — K) 



where 



Zt = 2v/cosh( J + K) cosh( J - K). 



From the physics point of view, both Pq and Pi can be thought 
of as Boltzmann-Gibbs distributions with inverse temperature 
(3=1: For the former, we define the Hamiltonian as 



£ {x) 



A 



-nlnZo - ]nP (xj\xi-i) 

4=1 

j ■ y^j-iffj (9) 



which can be thought of as the energy pertaining to nearest- 
neighbor interactions between spins in a one-dimensional 
array, that is, the one-dimensional Ising model (see, e.g., [7, 
Sect. 1.8]) with a coupling coefficient J, in the absence of a 
magnetic field. On the other hand, for Pi we define: 

n 

£i{x) = -nlaZi -^2lnPi(xi\xi-i) 

i=l 

= - J Xj-xXj - Ky^j- 

i i 

cosh(J - K)~ 



In- 



cosh( J + K) 



E 



Xi-1 



1 , cosh( J - K) 
K + - In ) J - 

2 cosh( J + K) 



zZ Xi 



A 



Xi—±Xi 



B z2- 



(10) 



where in the approximate equality we neglected "edge effects" 
that make the (relatively) small difference between J^. Xi and 
J2i Xi-i (for large n). This is the same Ising model as before, 
but now also with a magnetic field B. Thus, 

£i{x) - £ (x) = -B^Xj 

i 

is the energy injected by an abrupt application of the magnetic 
field B. We have therefore demonstrated that the entropy 
production due to the irreversibility of this abrupt magnetic 
field is (within the additive constant, AF = 1 • (In Zq — In Zij) 
proportional to the redundancy of the mismatched code. 

Example 2 - Run-length coding and the grand-canonical 
ensemble. The Boltzmann-Gibbs distribution of eq. ©, a.k.a. 
the canonical distribution, is the equilibrium distribution of 
a system that is allowed to exchange heat energy with its 
environment at a fixed temperature T. It also assumes that the 
system has a fixed number of particles n, and a fixed volume 
V, whenever the volume is a relevant factor. 



When the system is allowed to exchange with the envi- 
ronment, not only energy, but also matter, namely, particles, 
then eq. (O is extended to the grand-canonical distribution 
[2, Sect. 4.9], whose microstate is defined as (x, n), where n 
is now a random variable, and x is defined as before for the 
given n. According to this distribution, 



P(x,n) 



where 



Mpn-£(X)) 



sea, m) = E ^ n E e ~ mx) = E n ) 

n>0 X n>0 

is the grand partition function. The parameter /i, which is 
called the chemical potential, controls the average number of 
particles in the system. Note that P(x,n) can be thought of as 
P{n) ■ P(x\n) where P(x\n) obeys the canonical distribution 
for the given n and P(n) is proportional to e l3tin Z(f3,n). 
It is well known (see, e.g., [2]) that ZcTlnS(/3, /i) gives the 
equilibrium pressure-volume product of the system, PV . Now 
let Po(x,n) and Pi(x,n) be two grand-canonical distribu- 
tions that differ only in the chemical potentials, /ij, i = 0, 1, 
respectively. Applying the information inequality, we get 



< DfoWPj) 

= lnS(/3,/ii) -lnH(/3,/i ) 



(11) 



where N designates the random number of particles. Dividing 
by (3 and rearranging terms, this becomes: 

PiV > P a V + ( M i - Ho)E {N}, 
and after dividing by V (which is assumed fixed), we get: 

Pi >Po + (jii-H>)Eo{p}, 

where p = N/V is the density of particles. 

A natural information-theoretic analogue of this is run- 
length coding: Given a 0-1 binary memoryless source with 
a very high probability of '0', which we shall designate by 
e M {p, < 0, (3 — 1), the idea is to encode the number N 
of successive zeroes between every two consecutive ones. 
Clearly, the distribution of N is exponential 

Pr{A = n} = - 



S( M ) 



where, with a slight abuse of notation, we define 

1 



and where we have assumed £(x) = — \nP(x\n), and so, 
Z(l,n) = 1 for all n. Thus, when applying run-length 
coding, the price of mismatch in fi is parallel to the difference 
between the two sides of the above pressure inequality, 
where the 'pressure' in run-length coding is proportional to 
— ln(l — e''). As \i 1 0, the pressure increases, and more 
'particles' (i.e., runs of zeroes) enter into the system, which 
means that the runlengths becomes larger. Thus, we have 



demonstrated an analogy between run-length coding and the 
physics of the grand-canonical ensemble: the log-probability 
of '0' plays the role the chemical potential whereas the 
log-probability of '1' is associated with pressure. This 
concludes Example 2. 

Returning to the general framework, let us now see how 
the Gibbs' inequality is related to the DPT. Consider a triple 
of random variables (X, U, V) which form a Markov chain 
X -> U -> V. The DPT asserts that I(X; U) > I(X; V). We 
can obtain the DPT as a special case of the Gibbs' inequality 
because 

I(X;U)-I(X;V) = H(X\V) - H(X\U) 

= E{D(P xluy (-\U,V)\\P xlv (-\V))} 

where the expectation is w.r.t. the randomness of (U, V). 
Thus, For a given realization (u, v) of (U, V), consider the 
Hamiltonians £q(x) = — \nP(x\u) = —\nP(x\u,v) and 
£\{x) = — \nP(x\v), pertaining to a single 'particle' whose 
state is x. Let us also set j3 = 1. Thus, for a given (u,v): 

E a {W(X)} = ^2P(x\u,v)[lnP(x\u)-lnP(x\v)} 

X 

= H(X\V = v)-H(X\U = u) (12) 

and after further averaging w.r.t. (U,V), the average work be- 
comes H(X\V)-H(X\U) = I(X; U)-I(X; V). Concerning 
the free energies, we have 

Z (l) = ^exp{-l-[-hiP(x\u,v)]} 

X 

= e v ) = 1 c 13 ) 



and similarly, 



z 1 (i) = E^b) = i 



which means that F (l) = = 0, and so AF = 

as well. So by the Gibbs' inequality, the average work, 
I(X; U) — I(X; V), cannot be smaller than the free-energy 
difference, which in this case vanishes, namely, I(X; U) — 
I{X; V) > 0, which is the DPT. Note that in this case, 
there is a maximum degree of irreversibility: The identity 
I{X- U) - I(X; V) = H{X\V) - H{X\U) means that whole 
average work, W = I(X; U) — I{X; V), goes for entropy 
increase TAE = 1 ■ \H{X\V) - H(X\U)], whereas the free 
energy remains unchanged, as mentioned earlier. Moreover, 
the entire entropy increase goes to the system under discussion, 
and none of it goes to the environment. 

At this point a comment is in order: The rate loss of a 
suboptimal communication system, when viewed from the 
DPT perspective, may be attributed to two possible factors: 
one factor comes from a possible mismatch between actual 
distributions and optimum distributions in the information- 
theoretic sense, for example, the encoder may not induce 
the capacity-achieving channel input distribution or the test 
channel of the rate-distortion function. The other factor is a 



possible gap between mutual informations along the Markov 
chain (I(X; U) may be strictly larger than I(X; V)), which 
actually means information loss, and which is irreversible (U 
cannot be retreived from V). It is the latter kind of loss that 
is parallel to the irreversible free energy loss and dissipation. 

From a more general physical perspective, we can think of 
the Hamiltonian 

£ x {x) = £ {x) + \[£i(x) - £ (x)} 

as a linear interpolation between the two extremes, A = 
and A = 1, pertaining to So and £\, and then A can be 
thought of as a control parameter or a 'force' that influences 
the system. The Jarzynsky equality (cf. e.g., [4] and references 
therein) tells that under certain conditions on the system and 
the environment, and given any protocol for a temporal change 
in A, designated by {At}, for which A t = for all t < 0, and 
A{ = 1 for all t > t (t > 0), the work W applied to the 
system is a RV that satisfies 



E{< 

By Jensen's inequality, 



} 



,-PAF 



E{( 



-pw 



} > exp(-/3E{W}), 



which then gives .E}^} > AF, for an arbitrary protocol 
{At}. The Gibbs' inequality is then a special case, where At 
is given by the unit step function, but it applies regardless 
of the assumptions of [4]. At the other extreme, when At 
changes very slowly, corresponding to a reversible process, W 
approaches determinism, and then Jensen's inequality becomes 
tight. In the limit of an arbitrarily slow process, this yields 
W = AF, with no increase in entropy. 

B. Adiabatic Version 

Thus far, we discussed an isothermal process, where the 
change was attributed to the Hamiltonian - a transition from 
£o to E\. In the special case where the two Hamiltonians are 
proportional to one another, namely, when £i(x)/£o(x) = 
const., independent of x, one can, of course, still consider it as 
an isothermal process and refer the change in the Hamiltonian 
to that of a multiplicative control parameter A, as before (e.g., 
the harmonic potential ^x 2 ). But perhaps even more natural, 
in this case, is to refer the change to temperature. In this 
case, there is no external mechanical work, and the change 
in the internal energy of the system comes solely from heat: 
We replace a heat bath (large environement) with temperature 
To = l/(fc/3o) by a heat bath with a higher temperature 
T\ = l/(fc/3i). If we apply the Gibbs' inequality to this special 
case, this amounts to 

lnZ(/?i) > mZ(/3 ) + (Po - 0i)E Q {£ o (X)} 
which is easily shown (cf. eq. (O) to be equivalent to 

AS = EG3i)-E(A)) 

> p^E^oiX)} - E {£o(X)}} = ^ (14) 



where £(/3o) and £(/3i) are the equilibrium entropies (in units 
of k) pertaining to /?o and j3\, respectively, and AQ is the 
amount of heat injected into the system, assuming there is 
no mechanical work. This inequality is a special case of the 
Clausius theorem (mentioned earlier), which in its general 
form, asserts that AS = fcAE is never smaller than J dQ/T 
for any process, with equality in the case of a reversible 
process. The expression AQ/T\ is the result of this integral 
when the heat bath of temperature To is abruptly replaced by 
one with temperature T\. An alternative interpretation of this 
inequality is, again, as an instance of the second law: The 
entropy of our system increases by AS and the entropy of the 
(new) heat bath decreases by AQ/T\, thus the net entropy 
change of the combined system (which is assumed isolated), 
AS — AQ/Ti, must be non-negative. 

In the information-theoretic context, the relevant situa- 
tion is one where P(x\u,v) = P(x\u) and P(x\v) = 
J duP(x\u, v)P(u\v) can be represented as Boltzmann dis- 
tributions with the same Hamiltonian, but which may differ in 
temperature and possibly in shifts (by u or v). I.e., 

e — /3 £ (x— u) 

P(x\u, v) = P(x\u) ■ 



P{x\v) 



-f3 1 S(x-v) 



Z(Po) 
Pi<Po 



This turns out to be the case when X, U and V are related 
by a cascade of two additive channels of the same family 
(e.g., a degraded broadcast channel), one from V to U and 
the other from U to X (or in the other direction). Two clas- 
sical examples are those when both channels are binary and 
symmetric (with possibly two different crossover parameters), 
and when they are both Gaussian (with possibly different noise 
variances). Other examples of these properties could pertain to 
any choice of an infinitely divisible random variable as a noise 
model in both channels, like the Poisson RV, the binomial RV, 
and so on. 

Using again the Gibbs' inequality as before, we now get, 
for given u and v: 

lnZ(0i) > \nZ(Po) + PoE Po . u . v {£{X-u)}- 

/hEhwEiX-v), (15) 

where E@ 0>UtV denotes expectation w.r.t. P(x\u,v) as defined 
above. Now, assuming shift-invariance of integrals over x 
(as is the case in the BSC and Gaussian examples men- 
tioned above), Ep 0iUiV {£(X - it)} = E Pafifi {£(X)} = 
Ep a {£(X)}, independently of u and v. As for the third term, 
from the above relation between P(x\u,v) and P(x\v), it is 
apparent that after averaging Ep oUV {£ (X — v)} (which is 
independent of u) w.r.t. P(u\v), it becomes Ep liV {£(X — 

v)} = E Plfl {£(X)} = E Pl {£(X)}. Thus, we get 

£(/?i) = hxZ{Px)+f3iE^£(X)} 

> ]nZ((3 )+[3 E Po {£(X)} = -£(p ) (16) 

This is then a special case of the inequality AE > AQ/ (fcTi), 
where AQ = 0, namely, an adiabatic process, and then AE > 



0, or AS > 0. The information loss due to the DPT again has 
the physical interpretation of entropy increase, but this time it 
is purely due to temperature increase, rather than the dissipated 
work that we have seen before. 

We end this section with two simple examples, namely, 
the Gaussian broadcast channel and the binary symmetric 
broadcast channel. In both examples, we view the mutual 
information difference, which is the entropy increase, as an 
integral of temperature, and thereby identify the corresponding 
heat capacity from the integrand. 

Example 3 - Gaussian degraded broadcast channel: Consider 
a Gaussian degraded broadcast channel, i.e., a cascade of two 
independent additive white Gaussian noise (AWGN) channels, 
given by: 

Ni 



X = U 



and 



U = V + N 2 



where Ni and N 2 are both zero-mean, unit-variance Gaussian 
RV's, independent of each other as well as of V, which in turn 
has an arbitrary density with E{V 2 } < oo. In this case, 



I(X;U) - I(X;V) 
h(X\V) - h{X\U) 
1, /27re 

2 ln br 

2 ft 
Po d/3 



1, f2ne 
2 V A) 



1 



Pi 



Ti 



dT 

2T' 



where in the last step, we changed the integration variable 
from p to T = l/(/3fe). As mentioned in Section II, in the 
thermodynamic al definition, an entropy change is given by 



AS = fcAS 



dQ 

T 



along a reversible process, but dQ = C(T)dT, where C(T) 
is the heat capacity (at constant volume), and so, 



AS 



Ti 



dTC(T) 
T ' 



Thus, we identify the heat capacity pertaining the Gaussian 
broadcast channel as C(T) = k/2, independently of T, which 
is exactly the same as the heat capacity (per degree of freedom) 
of an ideal gas without gravitation (cf. e.g., [2, Sect. 4.4, p. 
106])@ This is because the Gaussian channel, considered in 

6 The classical heat capacity per particle of an ideal gas at constant volume 
is actually C = 3fc/2. The extra factor of 3 accounts for three degrees of 
freedom per particle, owing to the three dimensions of space. 



this example, induces a quadratic Hamiltonian, just like that 
of the ideal gas (cf. the first term of eq. ([T]i). 

It is instructive to examine also the case where the directions 
of the additive channels are reversed, or equivalently, to exam- 
ine the difference I(U; V) — I(X; V) for the original channels 
defined above. Adopting the latter definition, and using the 
main results of [8], concerning the relation between I(U; V) 
and the minimum mean square error (MMSE), mmse(y|i7), 
in estimating V from U (and of course, similar relations for 
X and V), we find that the increase in entropy is: 



I(U;V) - I(X;V) 



1 



mmse V 



mmse V 



mmse V 



Ti 



T 



mmse(V \V + NVkT) 
2fcT2 



d7X17) 



where N ~ JV(0, 1). Thus, now we identify the heat capacity 



C(T) 



mmse(V\V + N^kT) 
2T ' 



If, in addition, V is zero-mean, Gaussian, with variance <jy, 
then 



C{T) 



2(a^ + kT) 



In the high-SNR regime (a^ > kT), this gives C(T) w k/2, 
which is the same as before. 

Example 4 - binary symmetric degraded broadcast channel: 
In a similar manner, consider the binary symmetric degraded 
broadcast channel, that is, a cascade of two binary symmetric 
channels, 

X = U®Nr, U = V@N 2 , 

where all RV's are binary {0,1}, designates addition 
modulo 2, and (X, Ni,N 2 ) are independent. In this case, 
the Hamiltonian is £(x) = Eqx, x £ {0, 1}, where Eq is 
a constant (having the units of energy), and we have 



Pr{A^ =x} = 



-PoE x 



-PqEqX 



e{0,i} 



and similarly, 



Pr{7Vi 8 N 2 = x} = — ■=-=—. 

Here the heat capacity can be shown to be given by: 



C{T) = S 



,-E /(kT) 



kT 2 [l + e -s /(fcr)]2' 

which agrees with the heat capacity of a system of two-level 
non-interacting particles (see, e.g. [2, Sect. 4.3, eq. (4.22)]). 



V. Error Exponents and Reversible Processes 

We mentioned the notion of a reversible process, and the 
question that might naturally arise, at this point, concerns the 
information-theoretic analogue of this term. This seems to 
have a direct relationship to the behavior of error exponents 
of hypothesis testing and the Neyman-Pearson lemma: Let 
Pq(x) and Pi(x) be two probability distributions (or densities, 
in the continuous case) of a random variable X, taking values 
in an alphabet X. Given an observation x G X, one would 
like to decide whether it emerged from P or P x . A decision 
rule is a partition of X into two complementary regions X 
and X\, such that whenever X € X^ one decides in favor 
of the hypothesis that X has emerged from Pi, i = 0,1. 
Associated with any decision rule, there are two kinds of error 
probabilities: Pq(Xi) is the probability of deciding in favor 
of Pi while x has actually generated by Pq, and Pi(Xq) is 
the opposite kind of error. The Neyman-Pearson problem is 
about the quest for the optimum decision rule in the sense of 
minimizing Pi(Xq) subject to the constraint that Pq(Xi) < a 
for a prescribed constant a e [0,1]. The Neyman-Pearson 
lemma asserts that the optimum decision rule, in this sense, 
is given by the likelihood ratio test (LRT) X Q * = (X*)° = 
{x : Pq(x)/P\{x) > /i}, where the threshold /j = /i(a) is 
tuned so as to meet the constraint Pq(Xi) < a with equality 
(assuming that this is possible). 

Assume now that instead of one observation x, we have a 
vector x of n i.i.d. observations (x\, . . . , x n ), emerging either 
all from Pq, or all from P\. In this case, the error probabilities 
of the two kinds, pertaining to the LRT, Pq(x)/Pi(x) > a, 
can decay asymptotically exponentially, provided that a = a n 
is chosen to decay exponentially with n (though not too fast), 
and the asymptotic exponents, eo = lim„^oo [— i In Pq (X{ )] 
and ei = linin^^— i \vlP\(Xq )] can be easily found (e.g., 
by using the method of types) to be 



ei(A) = D(P X \\P{) = J2 P *( X 



In 



P l {x) 



0,1 



where 



with 



Pt\x)P?(x) 



Px(x) = 



Z(X) 



Z(X) 
Pt X (x)Pi(* 



and A € [0, 1] being a parameter (depending on fi) that controls 
the tradeoff between the error exponents of the two kinds: For 
A = 0, e (0) = and ei(0) = D(P \\Pi). As A grows from 
to 1, eo(A) increases and ei(A) decreases. Finally, for A = 1, 
e (l)=U(P 1 ||Po) and ei(l)=0. 

From the physics point of view, given Pq and Pi, let us 
define the Hamiltonians, £q(x) — — lnp)(x) and £\(x) = 
— m Pi (x), and let the inverse temperature be set to j3 = 1. 
Let P\(x) be defined as above, which can be referred to as 
the Boltzmann distribution with Hamiltonian £\(x) = (1 — 
\)£q(x) + X£i(x) and (3=1. Let X t , t e [0, t], be a function 
that starts from Aq = and ends at A r = 1. Now, assuming 



that the conditions for the Jarzynsky equality hold in this case, 
the average work along the process, which is 

E{W}= f dX t -E >H {£ 1 {X)-S (X)}, 
Jo 

cannot be smaller than AP, which in this case vanishes. As 
said, equality .EjVF} = AP = is attained for a reversible 
process. 

Indeed, these relations can easily be seen to hold here 
and also be related to the error exponents of Neyman- 
Pearson testing, and even from a direct derivation, without 
recourse to physical considerations: Considering the Hamilto- 
nians £i(x) = — \nPi(x), i = 0,1, as mentioned above, we 



have: 



E{W} 



Po(X) 



5 dXtE ^ n Pl (x) 



TdA^P^^ln^ 



Po(x) 



On) 

D(Px t \\Po)] 
dA t [ ei (A t )-e (A t )] 



dA t [P(P At ||Pi 



(18) 



On the other hand, we can also rewrite the second line of the 
last chain of equalities as: 



E{W} 



f dX t - 


~dhxZ{X)~ 


/o 


dX 



(19) 



A=A t 



Now, if {At} is everywhere differentiable (which is analogue 
to a reverisble process), this amounts to 



E{W} = - 



f dtXf 


' din Z(X)' 


Jo 


dX 


r dlnZ(A t ) 


Jo 


At 



\=\t 



lnZ(Ao) -lnZ(Xr) 
lnZ(0) -lnZ(l) 
lnl - lnl = 0. 



(20) 



If, on the other hand, {A t } contains jump-discontinuities, then 
every such jump, say, from Ai to A2, contributes to the integral 
a term of the form 



dX t 



dlnZ{X) 



dX 



= (A 2 -Ai) 



A=A t 



ainZ(A) 



dX 



A=Ai 



which is smaller than In ZiX^)— In Z(X{), due to the convexity 
of the function lnZ(A). Consequently, because of the minus 
sign, each such discontiuity increases P{W} above zero. 
Thus, we indeed see that, 

dA t ei(A t )> I dX t eo{X t ) 
Jo 

with equality in the differentiable (reversible) case. This in 

turn means that in this case, 

/ dtX t e (X t )= f dtX tei (X t ). 
Jo Jo 



The left- (resp. right-) hand side is simply J Q dAeo(A) (resp. 
f dAei(A)) which means that the areas under the graphs of 
the functions e$ and e\ are always the same. 

While these integral relations between the error exponent 
functions have actually been derived without recourse to any 
physical considerations, it is the physical point of view that 
gives the trigger to point out these relations. 
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